fix(cluster): resolve DNS failures on systemd-resolved hosts#516
Open
brianwtaylor wants to merge 2 commits intoNVIDIA:mainfrom
Open
fix(cluster): resolve DNS failures on systemd-resolved hosts#516brianwtaylor wants to merge 2 commits intoNVIDIA:mainfrom
brianwtaylor wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Docker's embedded DNS at 127.0.0.11 is only reachable from the container's own network namespace. k3s pods in child namespaces cannot reach it, causing silent DNS failures on Ubuntu and other systemd-resolved hosts where /etc/resolv.conf contains 127.0.0.53. Sniff upstream DNS resolvers from the host in the Rust bootstrap crate by reading /run/systemd/resolve/resolv.conf (systemd-resolved only — intentionally does NOT read /etc/resolv.conf to avoid bypassing Docker Desktop's DNAT proxy on macOS/Windows). Filter loopback addresses (127.x.x.x and ::1) and pass the result to the container as the UPSTREAM_DNS env var. Skip DNS sniffing for remote deploys where the local host's resolvers would be wrong. The entrypoint checks UPSTREAM_DNS first, falling back to /etc/resolv.conf inside the container for manual launches. This follows the existing pattern used by registry config, SSH gateway, GPU support, and image tags. Closes NVIDIA#437 Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
drew
reviewed
Mar 21, 2026
Collaborator
There was a problem hiding this comment.
This seems not necessary now, right?
Author
There was a problem hiding this comment.
good call. no need for these tests with the logic living properly in rust. apologies for the duplicate PR. lost the first while while deploying these changes in my test environment yesterday.
Drop deploy/docker/tests/test-dns-resolvers.sh — the resolver logic now lives in the Rust bootstrap crate with cargo test coverage, making the standalone shell harness redundant. Signed-off-by: Brian Taylor <brian.taylor818@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes #478
@drew — reworked per your review: the bootstrap crate now sniffs resolvers and passes them as an
UPSTREAM_DNSenv var. No system files are mounted into the container.Summary
/run/systemd/resolve/resolv.conf(systemd-resolved hosts only)UPSTREAM_DNSenv varUPSTREAM_DNSfirst, falls back to/etc/resolv.conffor manual launchesCloses #437
Changes
crates/openshell-bootstrap/src/docker.rs— Addresolve_upstream_dns()that reads/run/systemd/resolve/resolv.conf, filters loopback addresses, and returns real upstream resolvers. Pass them asUPSTREAM_DNSenv var to the cluster container (skipped for remote deploys). Includes unit tests.deploy/docker/cluster-entrypoint.sh— Addget_upstream_resolvers()that readsUPSTREAM_DNSenv var (priority) or falls back to/etc/resolv.conf. When upstream resolvers are found, write them directly to the k3s resolv.conf instead of relying on DNAT proxy. Improve DNS verification logging on failure.deploy/docker/tests/test-dns-resolvers.sh— Shell-level tests for the entrypoint resolver logic.Root Cause
Docker's embedded DNS at
127.0.0.11is only reachable from the container's own network namespace. The existing DNAT rules forward to this loopback address, but k3s pods run in child network namespaces where the forwarded packets are dropped as martian packets. On systemd-resolved hosts,/etc/resolv.confcontains127.0.0.53(another loopback), so the fallback also fails silently.DNS Flow — Before vs After
Testing
cgroupns=host)===VALIDATION TOPOLOGY===
==WHAT EACH NODE PROVED DURING VALIDATION===
Node A ─── "Does the fix break anything that already works?"
Captured baseline iptables, TLS certs, and DNS state.
All comparisons showed zero drift.
Node B ─── "Does the new code handle edge-case input safely?"
Node C ─── "Does the fix affect macOS hosts?"
No systemd-resolved → no UPSTREAM_DNS set → no change.
Existing DNAT proxy path untouched.
Node D ─── "Does the fix affect Windows/WSL2 hosts?"
No systemd-resolved → no UPSTREAM_DNS set → no change.
Existing DNAT proxy path untouched.
Automated Tests
Checklist